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Abstract. Given a text T and a pattern P, the order-preserving match¬ 
ing problem is to find all substrings in T which have the same relative 
orders as P. Order-preserving matching has been an active research area 
since it was introduced by Kubica et al. m and Kim et al. m- In 
this paper we present two algorithms for the multiple order-preserving 
matching problem, one of which runs in sublinear time on average and 
the other in linear time on average. Both algorithms run much faster 
than the previous algorithms. 

1 Introduction 

Given a text T and a pattern P, the order-preserving matching problem is to 
hnd all substrings in T which have the same relative orders as P. For example, 
given T = (10,15,20,25,15,30,20,25,30,35) and P = (35,40,30,45,35), P has 
the same relative orders as the substring T' = (20, 25,15,30,20) of T. In T' 
(resp. P), the first character 20 (resp. 35) is the second smallest number, the 
second character 25 (resp. 40) is the third smallest number, the third character 
15 (resp. 30) is the smallest number, and so on. See Fig. This problem is 
naturally generalized to the problem of finding multiple patterns. The order¬ 
preserving matching for a single pattern will be called the single order-preserving 
matching, and one for multiple patterns the multiple order-preserving matching. 
In this paper we are concerned with the multiple order-preserving matching 
problem. 

Order-preserving matching was introduced by Kubica et al. m and Kim et 
al. m, where Kubica et al. |13] defined order relations by order isomorphism 
of two strings, while Kim et al. El defined them explicitly by the sequence of 
rank values, which they called the natural representation. They both proposed 
0(n + m log to) time solutions for the single order-preserving matching based on 
the Knuth-Morris-Pratt algorithm, where n is the length of the text and to is 
the length of the pattern. Kim et al. m also proposed an 0(n log M) time algo¬ 
rithm for the multiple order-preserving matching based on the Aho-Gorasick al¬ 
gorithm, where M is the sum of lengths of all the patterns. Henceforth, there has 


2 


M. Han et al. 


been considerable research on the single and multiple order-preserving match¬ 
ing problems. For the single order-preserving matching, Cho et al. [1] proposed a 
method to apply the Boyer-Moore bad character rule to order-preserving match¬ 
ing by using the notion of g-grams. Chhabra and Tarhio [3] presented a more 
practical solution based on filtering. They hrst encoded input sequences into 
binary sequences and then applied standard string matching algorithms as a 
filtering method. Faro and Kiilekci [7] improved Chhabra and Tarhio’s solution 
by using new encoding techniques which reduced the false positive rate of the 
filtering step. For the multiple order-preserving matching, Belazzougui et al. 
[2] theoretically improved the solution of Kim et al. [H] by replacing the un¬ 
derlying data structure by the van-Emde-Boas tree. They achieved randomized 
0(n-min(loglogn, ^ lo'gi^gr ’ time for the search, where r is the length of the 
longest pattern and k is the number of patterns. 

Order-preserving matching has been an active research area and many re¬ 
lated problems have been studied such as order-preserving suffix trees [6] and 
order-preserving matching with k mismatches [8]. Kim et al. [10] extended the 
representations of order relations from binary relations to ternary relations. With 
their representations, one can modify earlier order-preserving matching algo¬ 
rithms to accommodate strings with duplicate characters, i.e., a number can 
appear more than once in a string. 



0 2 4 6 8 10 12 

■ text T » pattern P 


Fig. 1. An example of the order-preserving matching. P has the same relative orders 
as the substring T' = (20, 25,15, 30, 20) of T. 


In this paper, we present two new algorithms for the multiple order-preserving 
matching problem which are more efficient on average than the previously pro¬ 
posed algorithms. The algorithms are based on modifications of some conven¬ 
tional pattern matching algorithms such as Wu-Manber m and Karp-Rabin 
[9]. The first algorithm, called Algorithm I, uses the ideas of the Wu-Manber 
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algorithm, and the second algorithm, called Algorithm II, uses the ideas of the 
Karp-Rabin algorithm and the encoding techniques of Chhabra and Tarhio [3] 
and Faro and Kiilekci [7] for fingerprinting. Algorithm I runs in 0(^ log M) time 
on average, where n is the length of the text, m is the length of the shortest pat¬ 
tern, and M is the sum of lengths of all the patterns. Algorithm II runs in 0{n) 
time on average, assuming that M is polynomial with respect to m. In order to 
verify practical behaviors of our algorithms, we conducted experiments where 
the two algorithms were compared with the algorithms of Kim et al. m and 
Belazzougui et al. [3] . Experiments show that our algorithms run much faster in 
practice. 


2 Problem Formulation 

Let S denote a set of numbers such that a comparison of two numbers can be 
done in constant time, and let E* denote the set of strings over the alphabet E. 

For a string a; € A*, let |a;| denote the length of a;. A string x is described by a 
sequence of characters (a;[l], a;[2],..., a;[|a;|]). Let a substring x[i..j] be {x[i\,x[i + 

1],..., a:[j]) and a prefix Xi be a;[l..i]. For a character c G E, let rankx{c) = 

1 -I- |{t : x[i] < c for 1 < t < |a;|}|. 

We use the natural representation dehned by Kim et al. m to compare 
order relations of two strings. The natural representation is equivalent to order- 
isomorphism defined by Kubica et al. [13) . because the natural representation 
of two strings are identical if and only if they are order-isomorphic. 

Definition 1 (Natural representation [llj L For a string x of length n, the 
natural representation is defined asNat(x) = {rankx{x[l]),rankx{x[2]), ...,rankxix[n])). 

For example, for x = (30,40,30,45,35), the natural representation is Nat{x) = 

(1,4,1,5,3). We will simply say that x matches y if |a;| = \y\ and Nat{x) = 

Nat{y). 

Order-preserving matching can be defined in terms of the natural represen¬ 
tation. 

Definition 2 (Single Order-Preserving Matching [llj L Given a textT[l..n] G 

E* and a pattern G E* , P matches T at position i if Nat{P) = 

Nat(fr[i — m l..i]). The order-preserving matehing is the problem of finding all 
the positions of T matched with P. 

Definition is naturally generalized to the multiple order-preserving matching, 
formally defined in Definition 

Definition 3 (Multiple Order-Preserving Matching |11) I. Given a text 
T[l..n] G E* and a set of patterns P = {Pi, P 2 , ■■■, Pk} where Pi G E* for 
1 < i < k, the multiple order-preserving matching is the problem of finding all 
the positions ofT matched with any pattern in P. 
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There are two other representations in addition to the natural representation 
for comparing order relations of two strings: prefix representation and nearest 
neighbor representation. The prefix representation can be defined as a sequence 
of rank values of characters in prefixes. 

Definition 4 (Prefix Representation |TT]). For a string x, the prefix repre¬ 
sentation is defined as Pre{x) = (rankx^ rank^^ (a:[2]), rankx^^^ (a;[|x|])). 


For example, for x = (30,40,30,45,35), the prefix representation is Pre{x) = 
(1, 2,1,4,3). We can compute Pre{x) in time 0{\x\ log lx]) for general alphabet 
using the order-statistic tree m- The time complexity can be reduced to 0{\x\) 
if the characters can be sorted in 0(|a:|) time. 

Lemma 1. For two strings x and y where \x\ = \y\, if x matches y, then 
Pre{x) = Pre{y). 


The prefix representation has an ambiguity between different strings if they 
include duplicate characters. For example, when x = (10,30,20), and y = 
(10,20,20), the prefix representations of both x and y are (1,2,2), whereas 
their natural representations are different. Kim et al. defined a new represen¬ 
tation called the extended prefix representation HU] for strings with duplicate 
characters. We omit the details here. 

For the nearest neighbor representation, we define LMaXx[i] and LMinx[i] 
as follows. 


LMaXx[i] 


J j if x[j] = max{a;[fc] : x[k] < x[i] for 1 < A: < i — 1} 
( —oo if no such j , 


LM^nx[i] = {^ if:rb]=min 
( oo it no such J . 

If there are multiple j’s for LMaXx\i\ 


{x[k] : x[k] > x[i\ for 1 < fc < f — 1} 
or LMinx\i\, we choose the rightmost one. 


Definition 5 (Nearest Neighbor Representation |10II11| I. For a string x, 
the nearest neighbor representation is defined as NN{x) = 

f LMaXx\\x\]\ 


For example, for x = (30,40,30,45,30), the nearest neighbor representation is 
as follows. 



For convenience, let a:[—oo] = —oo, ai[oo] = oo, Nat(a:)[—oo] = 0 and Nat{x)[oo] = 
|a:|-|-l for any string a;. Then, Nat{x)[LMaXx[i]] < Nat{x)[i] < Nat{x)[LMinx[i]] 
holds for 1 < i < |a:|. 

The time complexity for computing NN(a;) is 0(|a:| log |a;|) [TT]. Using this 
representation, we can check if two strings match in time linear to the size of 
the input, even when the strings have duplicate characters. 
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Lemma 2. Given two strings x and y where |a;| = \y\, assume 

NN(x) is computed. Then we can determine whether x matches y in 0(|a::|) time. 


3 Algorithm I 

In this section, we present our first algorithm for the multiple order-preserving 
matching. Algorithm I is based on the Wu-Manber algorithm, which is widely 
used for multiple pattern matching. Algorithm I is divided into two steps: the 
preprocessing step and the searching step. 

3.1 Preprocessing Step of Algorithm I 

Let m be the length of the shortest pattern, and M be the sum of lengths of all 
the patterns. We consider only the first m characters of each pattern. Let V' = 
{P{, P 2 J ■ ■ ■ I Pk} where P/ = Pi[l..m] (this notation is provided only for clarity 
of exposition). In the preprocessing step, we build a SHIFT table and a HASH 
table based on V', which are analogous to those of the Wu-Manber algorithm. 
However, since we are looking for strings matched with patterns in terms of order¬ 
preserving matching, we have to consider the order representations of strings 
rather than strings themselves for comparison. Consider a block of length b on 
the text, where b <m. The SHIFT table determines the shift value based on the 
prefix representation of the given block. Given a block x, we define 

G = maxjj : Pre{Pl[j — b + I..j]) = Pre{x) ior 1 < i < k,b < j < m} . 

That is, lx means the position of the rightmost block in any P/ G V which is 
likely to match x. Here, the term ”is likely to” is used because Pre{x) = Pre{y) 
does not necessarily mean that x matches y. For convenience, let G = —00 if 
there is no such block. Then, the SHIFT table is defined as 

SHIFT[/(a;)] = min(TO — Ix^m — b + 1) , 

where f{x) is a fingerprint mapping a block x to an integer used as an index to 
the SHIFT table. Using the factorial number system m, we define f{x) as 

b 

fi^) = '^iPre{x)[i] - I) • (i - I)! . 

i=l 

Note that f{x) maps a block x into a unique integer within the range [0..5! — 1] 
according to its prefix representation. 

Fig. [^(a) shows the SHIFT table when there are three patterns. Assume 
that 6 = 3. Consider the block r[3..5]. The rightmost block in V' whose prefix 
representation equals that of T[3..5] is Pi[2..4]. The fingerprint /(r[3..5]) is 3. 
Thus, SHIFT[3] is to — 4 = 1. Note that in the figure, we can safely shift the 
patterns by I. 
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Fig. 2. SHIFT and HASH tables. 


The fingerprint is also used to index the HASH table. HASH[i] contains a 
pointer to the list of the patterns whose last block in V' is mapped to the 
fingerprint i. Fig. [^(b) shows the HASH table with the same patterns. 

To compute the values of the SHIFT table, we consider each pattern P/ 
separately. For each pattern P/, we compute the fingerprint of each block P/[j — 
b + l..j] consecutively, and set the corresponding value of the SHIFT table to 
the minimum between its current value (initially set to m — b + 1) and m — j. 
In order to obtain the fingerprint of a block, we have to compute its prefix 
representation. Once we compute the fingerprint of the first block Pre{P[[l..b]) 
using the order-statistic tree, the tree contains the first b characters of P/. To 
compute the prefix representations of the subsequent blocks, we observe that 
we can compute Pre{P-[j + l..j -I- 6]) by taking advantage of the order-statistic 
tree containing characters of the previous block P[\j..j + b — 1], Specifically, we 
erase P/[j] from the tree and insert the new character P/[j -|- 6] into the tree. 
Inserting and deleting an element into the order-statistic tree is accomplished in 
0(log6) time since the tree contains 0{b) elements. Then we traverse the tree 
in 0{h) time to retrieve the prefix representation of the new block. We repeat 
this until we reach the last block. When we reach the last block, we map into 
the HASH table and add Pi into the corresponding list. The whole process is 
performed for all the patterns. Since there are 0{km) blocks, it takes 0{kmb) 
time to construct the SHIFT and HASH tables. 

We also precompute the nearest neighbor representations of all the patterns, 
namely, NN(Pi) for I < i < fc. They are used in the searching step for verifying 
whether patterns actually match the text. Using the order-statistic tree, they are 
computed in 0(M log r) time, where r denotes the length of the longest pattern. 
As a result, the time complexity for the preprocessing step is 0{kmb-\- M log r). 


3.2 Searching Step of Algorithm I 

In the searching step, we find all the positions of T matched with any pattern in 
V. Fig.|^shows the pseudocode of Algorithm I. For the search, we slide a position 
pos along the text, reading a block of length 6, T[pos — b+l..pos\, and computing 
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the corresponding fingerprint i. If SHIFT(I) > 0, then we shift the search window 
to pos -h SHIFT(t) and continue the search. Otherwise, SHIFT(t) = 0 and there 
may be a match. Thus we select the list of patterns in HASH[t], and compare 
each pattern in the list with the text via the nearest neighbor representation. 
We call this process the verification step. We repeat this until we reach the end 
of the text. 


Algorithm I(P = {Fi, P 2 , • • • T[l..n]) 

1: m ■(—mini<i<k{\Pi\) 

2: Preprocess P and compute SHIFT, HASH, NN 

3: pos <r- m 

4: while pos < n do 

5: i f{T[pos -6-1- l..pos]) 

6: if SHIFT[i] = 0 then 

7: Verify each pattern in HASH[i] via NN 

8: pos -h- pos + 1 

9: else 

10: pos ^ pos -t SHIFT[i] 

11: end if 

12: end while 


Fig. 3. The pseudocode of Algorithm I 


3.3 Average Time for the Search of Algorithm I 

We present a simplified analysis of the average running time for the searching 
step. For the analysis, we assume that there are no duplicate characters in any b- 
length block in strings, i.e., any consecutive b characters in the text and patterns 
are distinct. Although this assumption restricts the generality of our problem, 
it is insignificant because: (1) a fairly large alphabet makes the case against 
the assumption very unlikely to happen; (2) even if it happens, the algorithm 
still works correctly without a significant impact on the performance in practice. 
We leave it as an open problem whether the average logM) time can be 
derived when the strings are totally random, which is more complicated. Now, 
we assume that each distinct block appears randomly at a given position (i.e., 
with the same probability). Let us denote a = lAI, then there are ^Pb different 
possible blocks and the probability of a block to appear is l/o-A- 

Lemma 3. For two random blocks x and y, where x,y G and each has no 
duplicate characters, the probability that Pre{x) = Pre{y) is 

Recall that Algorithm I determines a shift value according to the prefix repre¬ 
sentation of a current block on the text. 

Lemma 4. The probability that a random block x leads to a shift value of j, 
0 < j < m — b, is at most ^. 



M. Han et al. 


Lemma 5. The expected value of a shift during the search is at least (m — b + 

We set b = 1.5 log M/log log M. Then, by Stirling’s approximation [T], we 
can easily prove that b\ = 2 ^*°sb+bioge+ 0 (iogb) _ [2(^M), and thus the expected 
value of a shift is at least 0{m). Consequently, the average number of iterations 
of the while loop during the search is bounded by O(^). At each iteration, we 
compute a fingerprint and the computation takes 0(6 log 6) = O(logM) time. 
Lemma shows that the verification step at each iteration is accomplished in 
constant time on average. 

Lemma 6. The average cost of the verification step at each iteration is 0(1). 
Hence, the average time complexity of the searching step is roughly 0(^ logM). 

4 Algorithm II 

In this section, we present a simple algorithm that achieves average linear time 
for search. Algorithm II exploits the ideas of the Karp-Rabin algorithm and the 
encoding techniques of Chhabra and Tarhio [S] and Faro and Kiilekci [7j for 
fingerprinting. 


4.1 Fingerprinting in Algorithm II 

The Karp-Rabin algorithm is a practical string matching algorithm that makes 
use of fingerprints to find patterns, and it is important to choose a fingerprint 
function such that a fingerprint should be efhciently computed and efficiently 
compared with other fingerprints. Furthermore, the fingerprint function should 
be suitable for identifying strings in terms of order-preserving matching. 

Given an m-length pattern P, Chhabra and Tarhio [3] encode the pattern 
into a binary sequence /3(P) of length m — 1, where 


fi{P)[i] = 


1 if P[i\ < P[i + 1] 
0 otherwise. 


We consider the fingerprint P{P) as an (m — l)-bit binary number. We can 
compute /3(P) in time Oim). 

As m increases, the fingerprint /3(P) may be too large to work with; we need 
at least (in — 1) bits to represent a fingerprint. To address this issue, we compute 
the fingerprints as residues modulo a prime number p. According to [3] , we choose 
the prime p pseudorandomly in the range [l..TOn^]. With this choice, it is proved 
that the probability of a single false positive due to the modulo operation while 
searching is bounded by 2.53/n, which is negligibly small for sufhciently large 

n M- 

Faro and Kiilekci proposed more advanced encoding techniques such as 
g-NR and q-NO. Instead of comparing between only a pair of neighboring char¬ 
acters, they compared between a set of q characters for computing the relative 
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position of a character. We can compute fingerprints using those techniques sim¬ 
ilarly to above. We implemented Algorithm II using three encoding techniques, 
including Chhabra and Tarhio’s binary encoding [3], q-NR, and q-NO [7], for 
fingerprinting. In the following sections, we will describe the algorithm assuming 
the binary encoding. 


4.2 Preprocessing Step of Algorithm II 

Again, let V = {P{, ) P'k} be the set of m-length prefixes of the patterns. 

In the preprocessing step, we first compute /3(P/) for 1 < z < fc and build a HASH 
table. HASH[z] contains a pointer to the list of the patterns whose fingerprints 
equal i. We also compute NN(Pi) for 1 < z < fc. In total, the preprocessing step 
takes 0(M log r) time. Fig. |^(a) shows the HASH table when there are three 
patterns. We use a prime p = 7 in the example. 


m 



(a) 


m 


i 


1 

2 3 4 5 6 

7 8 9 10 11 

4 

5 

1 

3 

2 

4 

4 

3 

5 

7 

2 


PCT) 10 10 10 0 110 

' I 

(0101)2 mod7 = 5 

potential matches : HASH[5] =Pi, P 2 


(b) 


Fig. 4. (a) The HASH table, (b) An example of the search. For the window r[2..6], 
the corresponding fingerprint is 5. We check HASH[5], which has Pi,P 2 as elements, 
and thus verify them via NN. 


4.3 Searching Step of Algorithm II 

In the searching step, we scan the text T while iteratively computing fingerprints 
of the successive windows of size m. Fig. shows the pseudocode of Algorithm 
H. We slide a search window T[z..z -I- to — 1] along the text, computing the corre¬ 
sponding fingerprint j5. If the list pointed by HASH[/3] is not empty, we compare 
each pattern in the list with the text via its nearest neighbor representation. We 
call this process the verification step. We repeat this until we reach the end of 
the text. Fig. |^b) shows an example of the searching step. 
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Algorithm II(P = {Pi, P 2 , • • • , Pfc}, T[l..n]) 
1: m ■(—mini<i<k{\Pi\) 

2: Preprocess P and compute HASH, NN 
3: pos •<— m 

4: for i = 1 for n — m + 1 do 

5: P = P{T[i..i + m — 1]) mod p 

6: Verify each pattern in HASH[/i] via NN 

7: end for 


Fig. 5. The pseudocode of Algorithm II 


4.4 Average Time for the Search of Algorithm II 

At each iteration of the for loop, we compute the fingerprint /3 of the search 
window. Let us denote j3i = p{T[i..i + m — 1]) mod p, which is the fingerprint 
of the i-th search window. We can compute /3i in time 0{m). To compute the 
fingerprints for the subsequent windows, we observe that we can compute Pi+i 
from Pi using Horner’s rule [5], since 

P^+l = (2(/3i - H ■ P{T)[i\) + P{T)[i + to]) modp , 

where H = 2"*“^ (mod p) is a precomputed value. It is clear that this calculation 
is done in constant time. 

Now, we analyze the time spent to perform the verification step. We assume 
that the numbers in the text and patterns are statistically independent and uni¬ 
formly at random. The verification is performed when there is a match between 
encoded binary strings of the text and patterns. The probability that a 1 appears 
at a position of an encoded string is q = (cr^/2 — tT/2)/CT^ = (cr — 1)/2 ct. So the 
probability of a character match [5] is 

s = q^ + {l-qf = ^ + ^ . 

Since the odd positions of an encoded string are mutually independent, we 
can (upper) bound the probability of a match between two encoded strings by 
s(™-i)/^. Note that s < 5/8 for cr > 2. 

Lemma 7. When M is polynomial with respect to m, the average cost of the 
verification step during the search is 0(1). 

Hence, the average time complexity of the searching step is 0(n), when M is 
polynomial with respect to to. 

5 Experiments 

In order to verify the practical behaviors of our algorithms, we tested them 
against the previous algorithms based on the Aho-Corasick algorithm: Kim et 
al.’s m, and Belazzougui et al.’s [2]0Kim et al.’s algorithm is denoted by KEF, 


^ For the implementation of the van-Emde-Boas tree used in [2], we used the source 
code publicly available at https://code.google.eom/p/libveb/. 
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Belazzougui et al.’s by BPR, and Algorithm I by Algl. Algorithm II is denoted by 
Alg2, followed by a notation of the encoding technique adopted for fingerprinting. 
Specifically, Alg2-Bin refers to Algorithm II with Chhabra and Tarhio’s binary 
encoding [3], and Alg2-NR2 (resp. Alg2-N02) refers to Algorithm II with the 
g-NR (resp. g-NO) encoding of Faro and Kiilekci [7] where we set q = 2. All 
algorithms were implemented in C-I-+ and run on a Debian Linux 7(64bit) with 
Intel Xeon X5672 processor and 32 GB RAM. During the compilation, we used 
the 03 optimization option. 

We tested for a random text T of length n = 10® searched for k = 10, 50,100 
random patterns of length m = 5,10, 20, 50,100, respectively. All the texts and 
patterns were selected randomly from an integer alphabet S = {1, 2, • • • , 1000} 
(we tested for varying alphabet sizes, but we didn’t observe sensible differences 
in the results). For each combination of k and to, we randomly selected a text 
and patterns, and then ran each algorithm. We performed this 10 times and 
measured the average time for the searching step. Table [T] shows the results. 

Fig.§ in Appendix shows the average search times when k = 10. When to 
is less than 50, Alg2-Bin is the best among the algorithms, achieving a speed 
up of about 6 times compared to KEF, and about 14 times compared to BPR. 
As TO increases, however, Algl becomes better, achieving a speed up of about 
11 times compared to KEF, and about 24 times compared to BPR. This is due 
to the increase of the average shift value during the search. The reason that 
the average shift value increases is that since we set b = 1.5 log M/log log M, 
the block size increases as to increases, and thus the probability that a block 
appears in the patterns decreases. Fig. and in Appendix show the average 
search times when k = 50,100. They show similar trends with Fig. One thing 
to note is that as k increases, the point of to where Algl becomes for the first 
time faster than the Alg2 family increases. We attribute this to the fact that 
as k increases, a block appears more often in the patterns, which leads to lower 
shift values. 


6 Conclusion 


We proposed two efficient algorithms for the multiple order-preserving matching 
problem. Algorithm I is based on the Wu-Manber algorithm, and Algorithm II is 
based on the Karp-Rabin algorithm and exploits the encoding techniques of the 
previous works m- Algorithm I performs the multiple order-preserving match¬ 
ing in average 0(^ logM) time, and Algorithm II performs it in average 0{n) 
time when M is polynomial with respect to to. The experimental results show 
that both of the algorithms are much faster than the existing algorithms. When 
the lengths of the patterns are relatively short. Algorithm II with the binary 
encoding performs the best due to its inherent simplicity. However, Algorithm I 
becomes more efficient as the lengths of the patterns grow. 
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Table 1. Average search times with different values for k and m. 


k 

m 

KEF 

BPR 

Algl 

Alg2-Bin 

Alg2.NR2 

Alg2-N02 

10 

5 

527.3 

1215.1 

274.8 

107.6 

164.8 

186.5 

10 

544.3 

1258.2 

216.9 

91.5 

148.8 

197.6 

20 

557.1 

1254.8 

286.5 

88.4 

155.4 

194.8 

50 

556.2 

1213 

51.1 

65.4 

116.7 

203 

100 

561.7 

1244.8 

56.2 

70.4 

123.9 

206.9 

50 

5 

598 

1227.2 

647.8 

234.1 

215.3 

310 

10 

573.8 

1238.6 

269.6 

100.5 

152.3 

194.6 

20 

562.9 

1244.2 

308.6 

114.8 

187.1 

216.4 

50 

570.7 

1239.8 

313.5 

113.8 

184.5 

226.2 

100 

587.6 

1271.8 

55.4 

86.1 

150.6 

227.6 

100 

5 

569 

1291.1 

674.4 

395.6 

386.3 

307.3 

10 

629 

1304.3 

522.4 

81.9 

100.3 

150.5 

20 

589 

1250.2 

498.4 

102.6 

164 

205.1 

50 

605.3 

1259.9 

103.3 

86 

184.8 

225.7 

100 

588.9 

1247.2 

73.2 

53.8 

182 

227.5 
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Appendix 

Proof of Lemma The probability is equivalent to the probability that when 
randomly choosing a block a:, its prefix representation is the same as that of 
(already chosen) y. The sample space consists of all aPb possible blocks. Notice 
that once we choose b distinct characters regardless of order, we can order them 
to fit in any one of the 6! prefix representations. It means that there are (^) blocks 
belonging to each prefix representation. Therefore, the probability is (^) • -p- = 

Proof of Lemma The necessary condition for the case that x leads to a 
shift value j is that there exists a pattern P/ whose block ending at the position 
m — j belongs to the prefix representation of x. Since there are k patterns, the 
probability of the necessary condition is ^. □ 

Proof of Lemma Since all the entries of SHIFT were initialized to m—b + 1, 
the expected value of a shift is > j ’ ~ ^ += 

(^-&+I){l- ■ ■ □ 

Proof of Lemma At each iteration, the probability that a pattern leads 
to the verification step is ^ and the cost for the verification for Pi is 0{\Pi\) by 
Lemma Since there are k patterns, the expected cost of the verification step 
at each iteration is J2i=i = 0(1). □ 

Proof of Lemma At each iteration, the probability that a pattern Pi leads 
to the verification is at most Thus, the expected cost of the verification 

step is at most . 0{\P^\) = 0((|)™/^ • M), which is 0(1) when M 

is polynomial with respect to m. □ 
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Fig. 6. Average search times when k = 10. 



Fig. 7. Average search times when k = 50. 
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Fig. 8. Average search times when k = 100. 












